Character-Based Pivot Translation for Under-Resourced Languages and Domains
نویسنده
چکیده
In this paper we investigate the use of character-level translation models to support the translation from and to underresourced languages and textual domains via closely related pivot languages. Our experiments show that these low-level models can be successful even with tiny amounts of training data. We test the approach on movie subtitles for three language pairs and legal texts for another language pair in a domain adaptation task. Our pivot translations outperform the baselines by a large margin.
منابع مشابه
Employing Pivot Language Technique through Statistical and Neural Machine Translation Frameworks: the Case of Under-resourced Persian-spanish Language Pair
The quality of Neural Machine Translation (NMT) systems like Statistical Machine Translation (SMT) systems, heavily depends on the size of training data set, while for some pairs of languages, high-quality parallel data are poor resources. In order to respond to this low-resourced training data bottleneck reality, we employ the pivoting approach in both neural MT and statistical MT frameworks. ...
متن کاملTranslating from under-resourced languages: comparing direct transfer against pivot translation
In this paper we compare two methods for translating into English from languages for which few MT resources have been developed (e.g. Ukrainian). The first method involves direct transfer using an MT system that is available for this language pair. The second method involves translation via a cognate language, which has more translation resources and one or more advanced translation systems (e....
متن کاملCreation of comparable corpora for English-Urdu, Arabic, Persian
Statistical Machine Translation (SMT) relies on the availability of rich parallel corpora. However, in the case of under-resourced languages or some specific domains, parallel corpora are not readily available. This leads to under-performing machine translation systems in those sparse data settings. To overcome the low availability of parallel resources the machine translation community has rec...
متن کاملAnalysis and Evaluation of Comparable Corpora for Under Resourced Areas of Machine Translation
Lack of sufficient linguistic resources and parallel corpora for many languages and domains currently is one of the major obstacles to further advancement of automated translation. The solution proposed in this paper is to exploit the fact that non-parallel bior multilingual text resources are much more widely available than parallel translation data. This position paper presents previous resea...
متن کاملRBMT as an alternative to SMT for under-resourced languages
Despite SMT (Statistical Machine Translation) recently revolutionised MT for major language pairs, when addressing under-resourced and, to some extent, mildly-resourced languages, it still faces some difficulties such as the need of important quantities of parallel texts, the limited guaranty of the quality, etc. We thus speculate that RBMT (Rule Based Machine Translation) can fill the gap for ...
متن کامل